Unsupervised language filtering using the latent dirichlet allocation
نویسندگان
چکیده
To automatically build from scratch the language processing component for a speech synthesis system in a new language a purified text corpora is needed where any words and phrases from other languages are clearly identified or excluded. When using found data and where there is no inherent linguistic knowledge of the language/languages contained in the data, identifying the pure data is a difficult problem. We propose an unsupervised language identification approach based on Latent Dirichlet Allocation where we take the raw n-gram count as features without any smoothing, pruning or interpolation. The Latent Dirichlet Allocation topic model is reformulated for the language identification task and Collapsed Gibbs Sampling is used to train an unsupervised language identification model. We show that such a model is highly capable of identifying the primary language in a corpus and filtering out other languages present.
منابع مشابه
Unsupervised language model adaptation using latent semantic marginals
We integrated the Latent Dirichlet Allocation (LDA) approach, a latent semantic analysis model, into unsupervised language model adaptation framework. We adapted a background language model by minimizing the Kullback-Leibler divergence between the adapted model and the background model subject to a constraint that the marginalized unigram probability distribution of the adapted model is equal t...
متن کاملLatent Dirichlet Allocation with Topic-in-Set Knowledge
Latent Dirichlet Allocation is an unsupervised graphical model which can discover latent topics in unlabeled data. We propose a mechanism for adding partial supervision, called topic-in-set knowledge, to latent topic modeling. This type of supervision can be used to encourage the recovery of topics which are more relevant to user modeling goals than the topics which would be recovered otherwise...
متن کاملSpoken Language Recognition in the Latent Topic Simplex
This paper proposes the use of latent topic modeling for spoken language recognition, where a topic is defined as a discrete distribution over phone n-grams. The latent topics are trained in an unsupervised manner using the latent Dirichlet allocation (LDA) technique. Language recognition is then performed in a low dimensional simplex defined by the latent topics. We apply the Bhattacharyya mea...
متن کاملNovel weighting scheme for unsupervised language model adaptation using latent dirichlet allocation
A new approach for computing weights of topic models in language model (LM) adaptation is introduced. We formed topic clusters by a hard-clustering method assigning one topic to one document based on the maximum number of words chosen from a topic for that document in Latent Dirichlet Allocation (LDA) analysis. The new weighting idea is that the unigram count of the topic generated by hard-clus...
متن کاملUnsupervised Concept Annotation using Latent Dirichlet Allocation and Segmental Methods
Training efficient statistical approaches for natural language understanding generally requires data with segmental semantic annotations. Unfortunately, building such resources is costly. In this paper, we propose an approach that produces annotations in an unsupervised way. The first step is an implementation of latent Dirichlet allocation that produces a set of topics with probabilities for e...
متن کامل